Project: Ensemble Techniques¶Problem statement -Term Deposit Sale¶Goal:¶Using the data collected from existing customers, build a model that will help the marketing team identify potential customers who are elatively more likely to subscribe term deposit and thus increase their hit ratio.
Attribute information¶Input variables:¶Bank client data:¶Related to previous contact:¶Other attributes:¶Output variable (desired target):¶import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
from sklearn import metrics
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
#from sklearn.feature_extraction.text import CountVectorizer #DT does not take strings as input for the model fit step....
from IPython.display import Image
#import pydotplus as pydot
from sklearn import tree
from os import system
#plt.style.use('ggplot')
pd.options.display.float_format = '{:,.2f}'.format
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))
# Below we will read the data from the local folder
df = pd.read_csv("bank-full.csv")
# Now display the header
print ('Bank-Full.csv data set:')
df.head(10)
df.tail() ## to know how the end of the data looks like
df.info() # here we will see the number of entires(rows and columns), dtype, and non-nullcount
Insight:
There are no non-null values, i.e., we have a value for every column and row. There are 10 varialbes need to be changed from object to categorical in coming stetps.df.shape # size of the data set also shown in the cell above
neg_exp=df[df.pdays.lt(0)] # this is to see the number of negative values present
print (" the number of negative entries is",len(neg_exp.index))
# the output might be taken in consideration later on in the calculations.
df.describe().transpose() # Transpose is used here to read better the attribute
df.nunique() # Number of unique values in a column
# this help to identify categorical values.
Insights:
Based on the results above, we expect to have at least 6 categorical vaiables, those that only have from 2 to 4 type of entry. the variable "job" is categorical since it is related to types of jobs as well as 'month' since they are discrete imputs. the categorical variables are : ['job','marital', 'education','default','housing','loan','contact','month','poutcome','Target'] # Now we will get a list of unique values to evalaute how to arrange the data set
for a in list(df.columns):
n = df[a].unique()
# if number of unique values is less than 30, print the values. Otherwise print the number of unique values
if len(n)<30:
print(a + ': ')
print(df[a].value_counts(normalize=True))
print()
else:
print(a + ': ' +str(len(n)) + ' unique values')
print()
Insights:
The 82% of the colomn poutcomeis unknown, this column doesn't seems to add value to the calculations.Also the success category is 3%Most of the calls were done in May and there was no calls in december. This could be a keyfactor to consider for future campaign. We will see at the end of the project.
-The main contact type is by celular with 65%for feature in df.columns: # Loop through all columns in the dataframe
if df[feature].dtype == 'object': # Only apply for columns with categorical strings
df[feature] = pd.Categorical(df[feature])# Replace strings with an integer
print("This is the new Dtype for the datase")
print()
print (df.dtypes)
df.head(10)
df.boxplot(column="pdays",return_type='axes',figsize=(8,8))
df.boxplot(column="campaign",return_type='axes',figsize=(8,8))
df.boxplot(column="balance",return_type='axes',figsize=(8,8))
df.boxplot(column="duration",return_type='axes',figsize=(8,8))
Insight:
The boxplots above are not giving muh information. Data seesm to be continue covering a wide range of valuesthe pdays column seems to have continue values as seen in the boxplot and maybe will no add any value to the calculation.these boxplot will be evalauted with histograms belowfig, (ax1, ax2) = plt.subplots(nrows = 1, ncols = 2, figsize = (13, 5))
sns.boxplot(x = 'age', data = df, orient = 'v', ax = ax1)
ax1.set_xlabel('People Age', fontsize=15)
ax1.set_ylabel('Age', fontsize=15)
ax1.set_title('Age Distribution', fontsize=15)
ax1.tick_params(labelsize=15)
sns.distplot(df['age'], ax = ax2)
sns.despine(ax = ax2)
ax2.set_xlabel('Age', fontsize=15)
ax2.set_ylabel('Occurence', fontsize=15)
ax2.set_title('Age x Ocucurence', fontsize=15)
ax2.tick_params(labelsize=15)
plt.subplots_adjust(wspace=0.5)
plt.tight_layout()
# For them moment we can identify 8 items that has more than 2 type of entries or continues value
no_categ_df= ['age', 'balance','day', 'duration', 'campaign', 'pdays','previous']
df[no_categ_df].hist(stacked=False, bins=50, figsize=(30,30), layout=(4,2)); # Histogram of the tentative continous variable
#*** Please notice that some of these variable can be change to categorical inputs or dropped after the graphical evaluation.
Insights:
From the graphics above it is possible to see that the campaign (number of contact) is less than 10 and the average seen in the description above is 2All the plots,excep age and days, are strongly positive sknewness where the median is greater than the mode.df.columns # this line is to get the name of the column to be used below
Categ_df = ['job','marital', 'education','default','housing','loan','contact','month','poutcome','Target']
for i in Categ_df : # checking value counts of the list "Categ_df"
print(df[i].value_counts(normalize=True))
print()
Will proceeed to plot the categorical values for better visualization and decide whether replace values (yes=1 /NO=0) or apply one-hot encoding
fig, ax = plt.subplots()
fig.set_size_inches(25, 8)
sns.countplot(x = 'education', data = df[Categ_df])
ax.set_xlabel('Education Receieved', fontsize=16)
ax.set_ylabel('Count', fontsize=16)
ax.set_title('Education', fontsize=16)
ax.tick_params(labelsize=16)
sns.despine()
fig, ax = plt.subplots()
fig.set_size_inches(25, 8)
sns.countplot(x = 'marital', data = df[Categ_df])
ax.set_xlabel('Marital Status', fontsize=16)
ax.set_ylabel('Count', fontsize=16)
ax.set_title('Marital', fontsize=16)
ax.tick_params(labelsize=16)
sns.despine()
fig, ax = plt.subplots()
fig.set_size_inches(25, 8)
sns.countplot(x = 'job', data = df[Categ_df])
ax.set_xlabel('Types of Jobs', fontsize=16)
ax.set_ylabel('Number', fontsize=16)
ax.set_title('Job', fontsize=16)
ax.tick_params(labelsize=16)
sns.despine()
fig, ax = plt.subplots()
fig.set_size_inches(25, 8)
sns.countplot(x = 'poutcome', data = df[Categ_df])
ax.set_xlabel('Previous Marketing Campaign Outcome', fontsize=16)
ax.set_ylabel('Number of Previous Outcomes', fontsize=16)
ax.set_title('poutcome (Previous Marketing Campaign Outcome)', fontsize=16)
ax.tick_params(labelsize=16)
sns.despine()
fig, (ax1, ax2, ax3) = plt.subplots(nrows = 1, ncols = 3, figsize = (20,8))
sns.countplot(x = 'default', data = df[Categ_df], ax = ax1, order = ['no', 'yes'])
ax1.set_title('Default', fontsize=15)
ax1.set_xlabel('')
ax1.set_ylabel('Count', fontsize=15)
ax1.tick_params(labelsize=15)
sns.countplot(x = 'housing', data = df[Categ_df], ax = ax2, order = ['no', 'yes'])
ax2.set_title('Housing', fontsize=15)
ax2.set_xlabel('')
ax2.set_ylabel('Count', fontsize=15)
ax2.tick_params(labelsize=15)
sns.countplot(x = 'loan', data = df[Categ_df], ax = ax3, order = ['no', 'yes'])
ax3.set_title('Loan', fontsize=15)
ax3.set_xlabel('')
ax3.set_ylabel('Count', fontsize=15)
ax3.tick_params(labelsize=15)
plt.subplots_adjust(wspace=0.25)
fig, (ax1, ax2) = plt.subplots(nrows = 1, ncols = 2, figsize = (15,6))
sns.countplot(df[Categ_df]['contact'], ax = ax1)
ax1.set_xlabel('Contact', fontsize = 10)
ax1.set_ylabel('Count', fontsize = 10)
ax1.set_title('Contact Counts')
ax1.tick_params(labelsize=10)
sns.countplot(df[Categ_df]['month'], ax = ax2, order = ['mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec'])
ax2.set_xlabel('Months', fontsize = 10)
ax2.set_ylabel('')
ax2.set_title('Months Counts')
ax2.tick_params(labelsize=10)
plt.subplots_adjust(wspace=0.25)
#Based on the plots above, one-hot encoding will be applied to the the categorical variables
##with more than 2 classifications.
##THIS IS PART OF THE QUESTION 2: PREPARING THE DATA FOR THE MODEL
replaceStruct = {
"default":{"no": 0, "yes": 1 },
"housing":{"no": 0, "yes": 1 },
"loan": {"no": 0, "yes": 1 },
"Target":{"no":0, "yes":1},
} # All boolean columns will be changed to 1 and 0
oneHotCols=["marital","education","contact","poutcome","job","month"]
df=df.replace(replaceStruct)
df=pd.get_dummies(df, columns=oneHotCols)
df.head(10)
print (df.dtypes) ## This is to make sure there is not object in dataset (as it happened before due to a failure in the cell above)
df.corr() # with this function is hard to see any correlation.
#Another correlation methods
plt.figure(figsize=(30,60))
sns.heatmap(df.corr(),
annot=True,
linewidths=.5,
center=0,
cbar=False,
cmap="YlGnBu")
plt.show()
sns.pairplot(df, hue = 'Target') ## After several trials, it didn't finished in this computer
Insights
month of September, october and march. Surprisingly it has not correlation with May and it was the month with more phone calls. It also has a minor correlation with the celular phone calls as it was the main way of communication.# Data set was corrected in previous steps
print("This is the new Dtype for the datase")
df.dtypes
Observation
As part of the process of preparing the data, the coulmn Target was modified in previous steps: Yes =1 and No =0Also One-Hot encoding was done above in order to prepare the data for the model# from sklearn.model_selection import train_test_split << this is the library that will be used (loaded at the beginning)
X = df.drop('Target',axis=1) # Predictor feature columns (27 X m)
Y = df['Target'] # target variable(1 X m
##Split into training and test set
x_train, x_test,y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=1) # 1 is just any random seed number
print (x_train.shape, x_test.shape)
x_train.head()
x_train.info()
x_test.head() # this is to review the columns
y_test.head() # this is to make sure the split was done properly and that there is not string in the column.
print("{0:0.2f}% data is in training set".format((len(x_train)/len(df.index)) * 100))
print("{0:0.2f}% data is in test set".format((len(x_test)/len(df.index)) * 100))
#The following libraries will be used ( already imported at the beginning)
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, recall_score, precision_score, f1_score, roc_auc_score,accuracy_score
# #from sklearn.linear_model import LogisticRegression
# Fit the model on train
model = LogisticRegression(random_state=1,fit_intercept=False)
model.fit(x_train, y_train)
y_predict = model.predict(x_test) # Predicting the target variable on test data
# Observe the predicted and observed classes in a dataframe.
z = x_test.copy()
z['Observed Target'] = y_test
z['Predicted Target'] = y_predict
z.head()
print("Trainig accuracy =",model.score(x_train,y_train))
print()
print("Testing accuracy =",model.score(x_test, y_test))
print()
print("Recall = ",recall_score(y_test,y_predict))
print()
print("Precision = ",precision_score(y_test,y_predict))
print()
print("F1 Score =",f1_score(y_test,y_predict))
print()
print("Roc Auc Score =",roc_auc_score(y_test,y_predict))
cm=metrics.confusion_matrix(y_test, y_predict, labels=[1, 0])
df_cm = pd.DataFrame(cm, index = [i for i in ["Observed 1","Observed 0"]],
columns = [i for i in ["Predict 1","Predict 0"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True)
print('Confusion Matrix')
print(model.score(x_test, y_test))
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
logit_roc_auc = roc_auc_score(y_test, model.predict(x_test))
fpr, tpr, thresholds = roc_curve(y_test, model.predict_proba(x_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()
Insight
The recall for this logistic regression is low (20%) whereas the precision is 56%. Still these number and we expect to improve them with ensamble techniques.
## Feature Importance or Coefficients
fi = pd.DataFrame()
fi['Col'] = x_train.columns
fi['Coeff'] = np.round(abs(model.coef_[0]),2)
fi.sort_values(by='Coeff',ascending=False)
Insight All the coefficient are less than 1, and housing, contanct_unknown and poutcome_unknown are the top 3 followed by month_may. 4 out of 5 top coefficient are coming from the on-hot encoding which means that it was a good choise to apply that technique to those variables. In the coming steps we can confirm is the relevance of this variable are kept in other models.
We will build our model using the DecisionTreeClassifier function. Using default 'gini' criteria to split. Other option include 'entropy'.
dTree = DecisionTreeClassifier(criterion = 'entropy', random_state=1)
dTree.fit(x_train, y_train)
print("Train: %.2f" % dTree.score(x_train, y_train)) # performance on train data
print("Test: %.2f" % dTree.score(x_test, y_test)) # performance on test data
### While making this project, all the below updates were done looking for tht right way to install gaphviz
#!pip install Graphviz
# pip install pydotplus
#pip install six
#pip install --upgrade mglearn
#pip install mlrose
from sklearn.tree import export_graphviz
#from sklearn.externals.six import StringIO
#from six import StringIO
import six
import sys
sys.modules['sklearn.externals.six'] = six
import mlrose
from IPython.display import Image
import pydotplus
import graphviz
train_char_label = ['No', 'Yes']
Credit_Tree_File = open('credit_tree.dot','w')
dot_data = tree.export_graphviz(dTree, out_file=Credit_Tree_File, feature_names = list(x_train), class_names = list(train_char_label))
Credit_Tree_File.close()
retCode = system("dot -Tpng credit_tree.dot -o credit_tree.png")
if(retCode>0):
print("system command returning error: "+str(retCode))
else:
display(Image("credit_tree.png"))
Insight
The graph above is possible to see applying zoom to it. It clearly shows the overfitting of the modelclf_pruned = DecisionTreeClassifier(criterion = "entropy", random_state = 100,
max_depth=4, min_samples_leaf=5)
clf_pruned.fit(x_train, y_train)
print(clf_pruned.score(x_train, y_train))
print(clf_pruned.score(x_test, y_test))
print("Train: %.2f" % clf_pruned.score(x_train, y_train)) # performance on train data
print("Test: %.2f" % clf_pruned.score(x_test, y_test)) # performance on test data
Insight
The result a batter fit with a good and equal performance between the test and the train datay_train.value_counts()
dot_data = tree.export_graphviz(clf_pruned, out_file=dot_data,
filled=True, rounded=True,
special_characters=True,feature_names = list(x_train),class_names=['No', 'Yes'])
graph = pydotplus.graph_from_dot_data(dot_data)
graph.write_png('df.png')
Image(graph.create_png())
## Calculating feature importance
feat_importance = clf_pruned.tree_.compute_feature_importances(normalize=False)
feat_imp_dict = dict(zip(list(x_train), clf_pruned.feature_importances_))
feat_imp = pd.DataFrame.from_dict(feat_imp_dict,orient='index')
feat_imp.sort_values(by=0, ascending=False)
print (pd.DataFrame(clf_pruned.feature_importances_, columns = ["Imp"], index = x_train.columns))
Insight
From the feature importance dataframe we can infer thatduration, poutcome_success and contact_unknown are the variables that impact TARGETpreds_train = clf_pruned.predict(x_train)
preds_test = clf_pruned.predict(x_test)
acc_DT = accuracy_score(y_test, preds_test)
print("Trainig accuracy =",clf_pruned.score(x_train,y_train))
print()
print("Testing accuracy =",clf_pruned.score(x_test, y_test))
print()
print("Recall = ",recall_score(y_test,preds_test))
print()
print("Precision = ",precision_score(y_test,preds_test))
print()
print("F1 Score =",f1_score(y_test,preds_test))
print()
print("Roc Auc Score =",roc_auc_score(y_test,preds_test))
# Confusion matrix
pd.crosstab(y_test, preds_test, rownames=['Actual'], colnames=['Predicted'])
#NO =0 and YES = 1
print(clf_pruned.score(x_test , y_test))
y_predict = clf_pruned.predict(x_test)
cm=metrics.confusion_matrix(y_test, y_predict, labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g')
*Insight
When the tree is regularised, overfitting is reduced, but there is no increase in accuracy# Creating a function for visualizing classifier results
from yellowbrick.classifier import ClassificationReport, ROCAUC
def visClassifierResults(model_w_parameters):
viz = ClassificationReport(model_w_parameters)
viz.fit(x_train, y_train)
viz.score(x_test, y_test)
viz.show()
roc = ROCAUC(model_w_parameters)
roc.fit(x_train, y_train)
roc.score(x_test, y_test)
roc.show()
visClassifierResults(DecisionTreeClassifier(criterion = "entropy", max_depth=4))
#Store the accuracy results for each model in a dataframe for final comparison
resultsDf = pd.DataFrame({'Method':['Decision Tree'], 'accuracy': acc_DT,
'Recall': recall_score(y_test,preds_test),
'Precision':precision_score(y_test,preds_test),
'F1 Score':f1_score(y_test,preds_test),
'Roc Auc Score':roc_auc_score(y_test,preds_test)})
resultsDf = resultsDf[['Method', 'accuracy','Recall','Precision','F1 Score','Roc Auc Score']]
resultsDf
from sklearn.ensemble import RandomForestClassifier
from yellowbrick.classifier import ClassificationReport, ROCAUC
rfcl = RandomForestClassifier(n_estimators = 50)
rfcl = rfcl.fit(x_train, y_train)
pred_RF = rfcl.predict(x_test)
acc_RF = accuracy_score(y_test, pred_RF)
#y_predict = rfcl.predict(x_test)
print(rfcl.score(x_test, y_test))
cm=metrics.confusion_matrix(y_test, pred_RF ,labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g')
visClassifierResults(RandomForestClassifier(n_estimators = 50))
#Store the accuracy results for each model in a dataframe for final comparison
tempResultsDf = pd.DataFrame({'Method':['Random Forest'], 'accuracy': [acc_RF],
'Recall': recall_score(y_test,pred_RF),
'Precision':precision_score(y_test,pred_RF),
'F1 Score':f1_score(y_test,pred_RF),
'Roc Auc Score':roc_auc_score(y_test,pred_RF)})
resultsDf = pd.concat([resultsDf, tempResultsDf])
resultsDf = resultsDf[['Method', 'accuracy','Recall','Precision','F1 Score','Roc Auc Score']]
resultsDf
from sklearn.ensemble import AdaBoostClassifier
abcl = AdaBoostClassifier(n_estimators = 100, learning_rate=0.1, random_state=22)
abcl = abcl.fit(x_train, y_train)
pred_AB =abcl.predict(x_test)
acc_AB = accuracy_score(y_test, pred_AB)
#Store the accuracy results for each model in a dataframe for final comparison
tempResultsDf = pd.DataFrame({'Method':['Adaboost'], 'accuracy': [acc_AB],
'Recall': recall_score(y_test,pred_AB),
'Precision':precision_score(y_test,pred_AB),
'F1 Score':f1_score(y_test,pred_AB),
'Roc Auc Score':roc_auc_score(y_test,pred_AB)})
resultsDf = pd.concat([resultsDf, tempResultsDf])
resultsDf = resultsDf[['Method', 'accuracy','Recall','Precision','F1 Score','Roc Auc Score']]
resultsDf
y_predict = abcl.predict(x_test)
print(abcl.score(x_test , y_test))
cm=metrics.confusion_matrix(y_test, y_predict,labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g')
visClassifierResults(AdaBoostClassifier(n_estimators= 100, learning_rate=0.1, random_state=22))
from sklearn.ensemble import BaggingClassifier
bgcl = BaggingClassifier(n_estimators=50, max_samples= .7, bootstrap=True, oob_score=True, random_state=22)
bgcl = bgcl.fit(x_train, y_train)
pred_BG = bgcl.predict(x_test)
acc_BG = accuracy_score(y_test, pred_BG)
#Store the accuracy results for each model in a dataframe for final comparison
tempResultsDf = pd.DataFrame({'Method':['Bagging'], 'accuracy': [acc_BG],
'Recall': recall_score(y_test,pred_BG),
'Precision':precision_score(y_test,pred_BG),
'F1 Score':f1_score(y_test,pred_BG),
'Roc Auc Score':roc_auc_score(y_test,pred_BG)})
resultsDf = pd.concat([resultsDf, tempResultsDf])
resultsDf = resultsDf[['Method', 'accuracy','Recall','Precision','F1 Score','Roc Auc Score']]
resultsDf
y_predict = bgcl.predict(x_test)
print(bgcl.score(x_test , y_test))
cm=metrics.confusion_matrix(y_test, y_predict,labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g')
Insight
visClassifierResults (BaggingClassifier(n_estimators=50, max_samples= .7, bootstrap=True, oob_score=True, random_state=22))
from sklearn.ensemble import GradientBoostingClassifier
gbcl = GradientBoostingClassifier(n_estimators = 50, learning_rate = 0.1, random_state=22)
gbcl = gbcl.fit(x_train, y_train)
pred_GB = gbcl.predict(x_test)
acc_GB = accuracy_score(y_test, pred_GB)
y_predict = gbcl.predict(x_test)
print(gbcl.score(x_test , y_test))
cm=metrics.confusion_matrix(y_test, y_predict,labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g')
visClassifierResults(GradientBoostingClassifier(n_estimators = 50, learning_rate = 0.1, random_state=22))
#Store the accuracy results for each model in a dataframe for final comparison
tempResultsDf = pd.DataFrame({'Method':['Gradient Boost'], 'accuracy': [acc_GB],
'Recall': recall_score(y_test,pred_GB),
'Precision':precision_score(y_test,pred_GB),
'F1 Score':f1_score(y_test,pred_GB),
'Roc Auc Score':roc_auc_score(y_test,pred_GB)})
resultsDf = pd.concat([resultsDf, tempResultsDf])
resultsDf = resultsDf[['Method', 'accuracy','Recall','Precision','F1 Score','Roc Auc Score']]
resultsDf
Insight
The main objective of this project is to design a model that helps the marketing team identify potential customers who are elatively more likely to subscribe term deposit and thus increase their hit ratio. This goal can be interpreted as that the bank wants to increase the number of possitve answers once the client is contacted, for which we should identify which parameters will help us to be more accurate selecting the clients to be contacted.
Important Metric¶As mentioned above, the bank wants to increase the number of people to subscribe to term deposit i.e. less number of False Negative, if FN is high, the bank would lose the chance to increase the hit ratio. Hence Recall is the important metric as per the context of this classification excercise.
The precision of bagging is actually the lowest, for which if in any case the bank consider that is not covenient for any reason give the term deposit plan to the wrong person, then precision would play a more important role and in that case other model can be chosen
In the hetmaps is possible to observe the value of the False Negative (prediected negative and observed positive) is around 1000 for all the models except for bagging which has 792 counts. This is the reason for which we have better Recall with this algorithm for this particular excercise.
In this excercise one-hot encoding was applied only to the variable "job""marital","education","contact","poutcome","month". Wheras all the boolean columns were changed by 1 and 0. It will be a good excercise to try to apply one-hot encoding to all the categorical variables and compare the results of the model.
Applying one-hot encoding to all the categorical will evaluate the option that one particualar clasification is an important coefficient.
The command: sns.pairplot(df, hue = 'Target'), was the process that took more resources from the computer (longer time to calculate) and didn't finished every time it was runned. In real life where more colums are available this step should be reviwed carefully before let it run.